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Images can be segmented by first using a classifier to predict an affinity graph 
that reflects the degree to which image pixels must be grouped together and then 
partitioning the graph to yield a segmentation. Machine learning has been applied 
to the affinity classifier to produce affinity graphs that are good in the sense of 
minimizing edge misclassification rates. However, this error measure is only indi- 
rectly related to the quality of segmentations produced by ultimately partitioning 
the affinity graph. We present the first machine learning algorithm for training a 
classifier to produce affinity graphs that are good in the sense of producing seg- 
mentations that directly minimize the Rand index, a well known segmentation 
performance measure. 

The Rand index measures segmentation performance by quantifying the classifi- 
cation of the connectivity of image pixel pairs after segmentation. By using the 
simple graph partitioning algorithm of finding the connected components of the 
thresholded affinity graph, we are able to train an affinity classifier to directly 
minimize the Rand index of segmentations resulting from the graph partitioning. 
Our learning algorithm corresponds to the learning of maximin affinities between 
image pixel pairs, which are predictive of the pixel-pair connectivity. 



1 Introduction 

Supervised learning has emerged as a serious contender in the field of image segmentation, ever 
since the creation of training sets of images with "ground truth" segmentations provided by humans, 
such as the Berkeley Segmentation Dataset 1 15 |. Supervised learning requires 1) a parametrized 
algorithm that map images to segmentations, 2) an objective function that quantifies the performance 
of a segmentation algorithm relative to ground truth, and 3) a means of searching the parameter space 
of the segmentation algorithm for an optimum of the objective function. 

In the supervised learning method presented here, the segmentation algorithm consists of a 
parametrized classifier that predicts the weights of a nearest neighbor affinity graph over image 
pixels, followed by a graph partitioner that thresholds the affinity graph and finds its connected 
components. Our objective function is the Rand index 1 18] , which has recently been proposed as a 
quantitative measure of segmentation performance 1231 . We "soften" the thresholding of the classi- 
fier output and adjust the parameters of the classifier by gradient learning based on the Rand index. 
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Figure 1 : (left) Our segmentation algorithm. We first generate a nearest neighbor weighted affin- 
ity graph representing the degree to which nearest neighbor pixels should be grouped together. The 
segmentation is generated by finding the connected components of the thresholded affinity graph, 
(right) Affinity misclassification rates are a poor measure of segmentation performance. Affin- 
ity graph #1 makes only 1 error (dashed edge) but results in poor segmentations, while graph #2 
generates a perfect segmentation despite making many affinity misclassifications (dashed edges). 




Because maximin edges of the affinity graph play a key role in our learning method, we call it max- 
imin affinity learning of image segmentation, or MALIS. The minimax path and edge are standard 
concepts in graph theory, and maximin is the opposite-sign sibling of minimax. Hence our work can 
be viewed as a machine learning application of these graph theoretic concepts. MALIS focuses on 
improving classifier output at maximin edges, because classifying these edges incorrectly leads to 
genuine segmentation errors, the splitting or merging of segments. 

To the best of our knowledge, MALIS is the first supervised learning method that is based on opti- 
mizing a genuine measure of segmentation performance. The idea of training a classifier to predict 
the weights of an affinity graph is not novel. Affinity classifiers were previously trained to minimize 
the number of misclassified affinity edges ||9l[T6l. This is not the same as optimizing segmentations 
produced by partitioning the affinity graph. There have been attempts to train affinity classifiers to 
produce good segmentations when partitioned by normalized cuts 1 17, 2|. But these approaches do 
not optimize a genuine measure of segmentation performance such as the Rand index. The work of 
Bach and Jordan | 2 1 is the closest to our work. However, they only minimize an upper bound to a 
renormalized version of the Rand index. Both approaches require many approximations to make the 
learning tractable. 

In other related work, classifiers have been trained to optimize performance at detecting image pixels 
that belong to object boundaries 1 16, 6, J4J. Our classifier can also be viewed as a boundary detector, 
since a nearest neighbor affinity graph is essentially the same as a boundary map, up to a sign 
inversion. However, we combine our classifier with a graph partitioner to produce segmentations. 
The classifier parameters are not trained to optimize performance at boundary detection, but to 
optimize performance at segmentation as measured by the Rand index. 

There are also methods for supervised learning of image labeling using Markov or conditional ran- 
dom fields 1 10 |. But image labeling is more similar to multi-class pixel classification rather than 
image segmentation, as the latter task may require distinguishing between multiple objects in a 
single image that all have the same label. 

In the cases where probabilistic random field models have been used for image parsing and seg- 
mentation, the models have either been simplistic for tractability reasons 1 12 | or have been trained 
piecemeal. For instance, Tu et al. 1 22 1 separately train low-level discriminative modules based on a 
boosting classifier, and train high-level modules of their algorithm to model the joint distribution of 
the image and the labeling. These models have never been trained to minimize the Rand index. 



2 Partitioning a thresholded affinity graph by connected components 

Our class of segmentation algorithms is constructed by combining a classifier and a graph partitioner 
(see Figure [T]). The classifier is used to generate the weights of an affinity graph. The nodes of the 
graph are image pixels, and the edges are between nearest neighbor pairs of pixels. The weights of 
the edges are called affinities. A high affinity means that the two pixels tend to belong to the same 
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segment. The classifier computes the affinity of each edge based on an image patch surrounding the 
edge. 

The graph partitioner first thresholds the affinity graph by removing all edges with weights less 
than some threshold value 0. The connected components of this thresholded affinity graph are the 
segments of the image. 

For this class of segmentation algorithms, it's obvious that a single misclassified edge of the affinity 
graph can dramatically alter the resulting segmentation by splitting or merging two segments (see 
Fig. [T]). This is why it is important to learn by optimizing a measure of segmentation performance 
rather than affinity prediction. 

We are well aware that connected components is an exceedingly simple method of graph partition- 
ing. More sophisticated algorithms, such as spectral clustering |20| or graph cuts | 3 1, might be more 
robust to misclassifications of one or a few edges of the affinity graph. Why not use them instead? 
We have two replies to this question. 

First, because of the simplicity of our graph partitioning, we can derive a simple and direct method 
of supervised learning that optimizes a true measure of image segmentation performance. So far 
learning based on more sophisticated graph partitioning methods has fallen short of this goal |[T7l[2l. 

Second, even if it were possible to properly learn the affinities used by more sophisticated graph 
partitioning methods, we would still prefer our simple connected components. The classifier in 
our segmentation algorithm can also carry out sophisticated computations, if its representational 
power is sufficiently great. Putting the sophistication in the classifier has the advantage of making it 
learnable, rather than hand-designed. 

The sophisticated partitioning methods clean up the affinity graph by using prior assumptions about 
the properties of image segmentations. But these prior assumptions could be incorrect. The spirit of 
the machine learning approach is to use a large amount of training data and minimize the use of prior 
assumptions. If the sophisticated partitioning methods are indeed the best way of achieving good 
segmentation performance, we suspect that our classifier will learn them from the training data. If 
they are not the best way, we hope that our classifier will do even better. 

3 The Rand index quantifies segmentation performance 

Image segmentation can be viewed as a special case of the general problem of clustering, as image 
segments are clusters of image pixels. Long ago. Rand proposed an index of similarity between two 
clusterings 1 18|. Recently it has been proposed that the Rand index be applied to image segmen- 
tations | 23 |. Define a segmentation S as an assignment of a segment label S/ to each pixel i. The 
indicator function 5 (s/, Sy) is 1 if pixels i and ; belong to the same segment (s/ = Sy) and otherwise. 

Given two segmentations S and S of an image with N pixels, define the function 



which is the fraction of image pixel pairs on which the two segmentations disagree. We will refer to 
the function 1 — ]^J(S, S) as the Rand index, although strictly speaking the Rand index is ]^J(S, S), 
the fraction of image pixel pairs on which the two segmentations agree. In other words, the Rand 
index is a measure of similarity, but we will often apply that term to a measure of dissimilarity. 

In this paper, the Rand index is applied to compare the output S of a segmentation algorithm with a 
ground truth segmentation S, and will serve as an objective function for learning. Figure [TJillustrates 
why the Rand index is a sensible measure of segmentation performance. The segmentation of affinity 
graph #1 incurs a huge Rand index penalty relative to the ground truth. A single wrongly classified 
edge of the affinity graph leads to an incorrect merger of two segments, causing many pairs of 
image pixels to be wrongly assigned to the same segment. On the other hand, the segmentation 
corresponding to affinity graph #2 has a perfect Rand index, even though there are misclassifications 
in the affinity graph. In short, the Rand index makes sense because it strongly penalizes errors in the 
affinity graph that lead to split and merger errors. 
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Figure 2: The Rand index quantifies segmentation performance by comparing the difference in 
pixel pair connectivity between the groundtruth and test segmentations. Pixel pair connectiv- 
ities can be visualized as symmetric binary block-diagonal matrices S{si,Sj). Each diagonal block 
corresponds to connected pixel pairs belonging to one of the image segments. The Rand index incurs 
penalties when pixels pairs that must not be connected are connected or vice versa. This corresponds 
to locations where the two matrices disagree. An erroneous merger of two groundtruth segments in- 
curs a penalty proportional to the product of the sizes of the two segments. Split errors are similarly 
penalized. 



4 Connectivity and maximin affinity 

Recall that our segmentation algorithm works by finding connected components of the thresholded 
affinity graph. Let S be the segmentation produced in this way. To apply the Rand index to train 
our classifier, we need a simple way of relating the indicator function S{si,Sj) in the Rand index 
to classifier output. In other words, we would like a way of characterizing whether two pixels are 
connected in the thresholded affinity graph. 

To do this, we introduce the concept of maximin affinity, which is defined for any pair of pixels in an 
affinity graph (the definition is generally applicable to any weighted graph). Let A^/be the affinity 
of pixels k and /. Let Vjj be the set of all paths in the graph that connect pixels i and For every 
path P in P/y, there is an edge (or edges) with minimal affinity. This is written as min^^ /^^p A^/, 
where (fc, /) G P means that the edge between pixels k and I are in the path P. 

A maximin path P^j is a path between pixels i and ; that maximizes the minimal affinity, 

P^: = are max min Avi (2) 

The maximin affinity of pixels i and ; is the affinity of the maximin edge, or the minimal affinity of 
the maximin path, 

A^- = max min Avi (3) 

PeVij{Kl)eP 

We are now ready for a trivial but important theorem. 

Theorem 1. A pair of pixels is connected in the thresholded affinity graph if and only if their 
maximin affinity exceeds the threshold value. 

Proof By definition, a pixel pair is connected in the thresholded affinity graph if and only if there 
exists a path between them. Such a path is equivalent to a path in the unthresholded affinity graph 
for which the minimal affinity is above the threshold value. This path in turn exists if and only if the 
maximin affinity is above the threshold value. □ 

As a consequence of this theorem, pixel pairs can be classified as connected or disconnected by 
thresholding maximin affinities. Let S be the segmentation produced by thresholding the affinity 
graph Aij and then finding connected components. Then the connectivity indicator function is 

5{si,Sj) = H{A*j-e) (4) 

where H is the Heaviside step function. 

Maximin affinities can be computed efficiently using minimum spanning tree algorithms 1 8 1. A max- 
imum spanning tree is equivalent to a minimum spanning tree, up to a sign change of the weights. 
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Any path in a maximum spanning tree is a maximin path. For our nearest neighbor affinity graphs, 
the maximin affinity of a pixel pair can be computed inO(|£| •A:(|y|)) where | E | is the number of 
graph edges and \ V\is the number of pixels and a:(-) is the inverse Ackerman function which grows 
sub-logarithmically. The full matrix A^j can be computed in time 0(|yp) since the computation 
can be shared. Note that maximin affinities are required for training, but not testing. For segmenting 
the image at test time, only a connected components computation need be performed, which takes 
time linear in the number of edges | £ | . 



5 Optimizing the Rand index by learning maximin affinities 

Since the affinities and maximin affinities are both functions of the image I and the classifier param- 
eters W, we will write them as A/y(J; W) and A^*y(J; W), respectively. By Eq. Q of the previous 
section, the Rand index of Eq. ^ takes the form 



RI{SJ)W) 



S{si^Sj)-H{A^I;W)-e) 



Since this is a discontinuous function of the maximin affinities, we make the usual relaxation by 
replacing \S{si,Sj) - H(A*.(J;W) - ^)| with a continuous loss function /({5(s/,Sy), A*.(J; W)). 

Any standard loss such as the such as the square loss, j{x — x)^, or the hinge loss can be used for 
/(x, x). Thus we obtain a cost function suitable for gradient learning. 



£(S,I;W) = 



N 



N 



Vl{S{si,Sj),max min Afc/(J;W)) 



PeVij {k,l)eP 



(5) 



The max and min operations are continuous and differentiable (though not continuously differen- 
tiable). If the loss function I is smooth, and the affinity A^/(J; W) is a smooth function, then the 
gradient of the cost function is well-defined, and gradient descent can be used as an optimization 
method. 

Define (fc, /) = mm{i,j) to be the maximin edge for the pixel pair (/,;). If there is a tie, choose 
between the maximin edges at random. Then the cost function takes the form 

It's instructive to compare this with the cost function for standard affinity learning 

^standard{SJ;^N) = ^ E '(^(^zV Sy), ( J; W)) 

(hi) 

where the sum is over all nearest neighbor pixel pairs (z,;) and c is the number of nearest neighbors 
191 . In contrast, the sum in the MALIS cost function is over all pairs of pixels, whether or not they 
are adjacent in the affinity graph. Note that a single edge can be the maximin edge for multiple pairs 
of pixels, so its affinity can appear multiple times in the MALIS cost function. Roughly speaking, 
the MALIS cost function is similar to the standard cost function, except that each edge in the affinity 
graph is weighted by the number of pixel pairs that it causes to be incorrectly classified. 



6 Online stochastic gradient descent 

Computing the cost function or its gradient requires finding the maximin edges for all pixel pairs. 
Such a batch computation could be used for gradient learning. However, online stochastic gradient 
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learning is often more efficient than batch learning 1 13 1. Online learning makes a gradient update of 
the parameters after each pair of pixels, and is implemented as described in the box. 



Maximin affinity learning 

1 . Pick a random pair of (not necessarily nearest 
neighbor) pixels i and j from a randomly drawn 
training image I. 

2. Find a maximin edge mm{i,j) 

3. Make the gradient update: 

W - W + ,/3^Z(^(s,-,Sy),A..(,-,;)(I;W)) 


Standard affinity learning 

1 . Pick a random pair of nearest neighbor pixels i 
and j from a randomly drawn training image I 

2. Make the gradient update: 





For comparison, we also show the standard affinity learning |9|. For each iteration, both learning 
methods pick a random pair of pixels from a random image. Both compute the gradient of the weight 
of a single edge in the affinity graph. However, the standard method picks a nearest neighbor pixel 
pair and trains the affinity of the edge between them. The maximin method picks a pixel pair of 
arbitrary separation and trains the minimal affinity on a maximin path between them. 



Effectively, our connected components performs spatial integration over the nearest neighbor affinity 
graph to make connectivity decisions about pixel pairs at large distances. MALIS trains these global 
decisions, while standard affinity learning trains only local decisions. MALIS is superior because it 
truly learns segmentation, but this superiority comes at a price. The maximin computation requires 
that on each iteration the affinity graph be computed for the whole image. Therefore it is slower 
than the standard learning method, which requires only a local affinity prediction for the edge being 
trained. Thus there is a computational price to be paid for the optimization of a true segmentation 
error. 

7 Application to electron microscopic images of neurons 

7.1 Electron microscopic images of neural tissue 

By 3d imaging of brain tissue at sufficiently high resolution, as well as identifying synapses and trac- 
ing all axons and dendrites in these images, it is possible in principle to reconstruct connectomes, 
complete "wiring diagrams" for a brain or piece of brain (191141 [211. Axons can be narrower than 
100 nm in diameter, necessitating the use of electron microscopy (EM) |[T9l . At such high spatial 
resolution, just one cubic millimeter of brain tissue yields teravoxel scale image sizes. Recent ad- 
vances in automation are making it possible to collect such images |T9l|4l|21h but image analysis 
remains a challenge. Tracing axons and dendrites is a very large-scale image segmentation problem 
requiring high accuracy. The images used for this study were from the inner plexiform layer of the 
rabbit retina, and were taken using Serial Block-Face Scanning Electron Microscopy |5 1. Two large 
image volumes of 100^ voxels were hand segmented and reserved for training and testing purposes. 

7.2 Training convolutional networks for affinity classification 

Any classifier that is a smooth function of its parameters can be used for maximin affinity learning. 
We have used convolutional networks (CN), but our method is not restricted to this choice. Convo- 
lutional networks have previously been shown to be effective for similar EM images of brain tissue 
Mil . 

We trained two identical four-layer CNs, one with standard affinity learning and the second with 
MALIS. The CNs contained 5 feature maps in each layer with sigmoid nonlinearities. All filters in 
the CN were 5 x 5 x 5 in size. This led to an affinity classifier that uses al7x 17x17 cubic image 
patch to classify a affinity edge. We used the square-square loss function l{x,x) = x • max(0, 1 — 
X — tn)^ + (1 — x) • max(0, x — m)^, with a margin m = 0.3. 

As noted earlier, maximin affinity learning can be significantly slower than standard affinity learning, 
due to the need for computing the entire affinity graph on each iteration, while standard affinity 
training need only predict the weight of a single edge in the graph. For this reason, we constructed 
a proxy training image dataset by picking all possible 21 x 21 x 21 sized overlapping sub-images 
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from the original training set. Since each 21 x 21 x 21 sub-image is smaller than the original image, 
the size of the affinity graph needed to be predicted for the sub-image is significantly smaller, leading 
to faster training. A consequence of this approximation is that the maximum separation between 
image pixel pairs chosen for training is less than about 20 pixels. A second means of speeding up the 
maximin procedure is by pretraining the maximin CN for 500,000 iterations using the fast standard 
affinity classification cost function. At the end, both CNs were trained for a total of 1,000,000 
iterations by which point the training error plateaued. 

7.3 Maximin learning leads to dramatic improvement in segmentation performance 



A. Clustering accuracy B. ROC curve C. Precision-Recall curve D. Splits vs. Mergers 
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Figure 3: Quantification of segmentation performance on 3d electron microscopic images of 
neural tissue. A) Clustering accuracy measuring the number of correctly classified pixel pairs. B) 
and C) ROC curve and precision-recall quantification of pixel-pair connectivity classification shows 
near perfect performance. D) Segmentation error as measured by the number of splits and mergers. 

We benchmarked the performance of the standard and maximin affinity classifiers by measuring 
the the pixel-pair connectivity classification performance using the Rand index. After training the 
standard and MALIS affinity classifiers, we generated affinity graphs for the training and test im- 
ages. In principle, the training algorithm suggests a single threshold for the graph partitioning. 
In practice, one can generate a full spectrum of segmentations leading from over- segmentations to 
under- segmentations by varying the threshold parameter. In Fig. [3j we plot the Rand index for 
segmentations resulting from a range of threshold values. 

In images with large numbers of segments, most pixel pairs will be disconnected from one another 
leading to a large imbalancing the number of connected and disconnected pixel pairs. This is re- 
flected in the fact that the Rand index is over 95% for both segmentation algorithms. While this 
imbalance between positive and negative examples is not a significant problem for training the affin- 
ity classifier, it can make comparisons between classifiers difficult to interpret. Instead, we can use 
the ROC and precision-recall methodologies, which provide for accurate quantification of the accu- 
racy of classifiers even in the presence of large class imbalance. From these curves, we observe that 
our maximin affinity classifier dramatically outperforms the standard affinity classifier. 

Our positive results have an intriguing interpretation. The poor performance of the connected com- 
ponents when applied to a standard learned affinity classifier could be interpreted to imply that 1) a 
local classifier lacks the context important for good affinity prediction; 2) connected components is 
a poor strategy for image segmentation since mistakes in the affinity prediction of just a few edges 
can merge or split segments. On the contrary, our experiments suggest that when trained properly, 
thresholded affinity classification followed by connected components can be an extremely competi- 
tive method of image segmentations. 

8 Discussion 

In this paper, we have trained an affinity classifier to produce affinity graphs that result in excellent 
segmentations when partitioned by the simple graph partitioning algorithm of thresholding followed 
by connected components. The key to good performance is the training of a segmentation-based cost 
function, and the use of a powerful trainable classifier to predict affinity graphs. Once trained, our 
segmentation algorithm is fast. In contrast to classic graph-based segmentation algorithms where 
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Figure 4: A 2d cross-section through a 3d segmentation of the test image. The maximin segmen- 
tation correctly segments several objects which are merged in the standard segmentation, and even 
correctly segments objects which are missing in the groundtruth segmentation. Not all segments 
merged in the standard segmentation are merged at locations visible in this cross section. Pixels col- 
ored black in the machine segmentations correspond to pixels completely disconnected from their 
neighbors and represent boundary regions. 

the partitioning phase dominates, our partitioning algorithm is simple and can partition graphs in 
time linearly proportional to the number of edges in the graph. We also do not require any prior 
knowledge of the number of image segments or image segment sizes at test time, in contrast to other 
graph partitioning algorithms ir7l[2Ql. 

The formalism of maximin affinities used to derive our learning algorithm has connections to single- 
linkage hierarchical clustering, minimum spanning trees and ultrametric distances. Felzenszwalb 
and Huttenlocher | 7 1 describe a graph partitioning algorithm based on a minimum spanning tree 
computation which resembles our segmentation algorithm, in part. The Ultrametric Contour Map 
algorithm |1| generates hierarchical segmentations nearly identical those generated by varying the 
threshold of our graph partitioning algorithm. Neither of these methods incorporates a means for 
learning from labeled data, but our work shows how the performance of these algorithms can be 
improved by use of our maximin affinity learning. 
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